Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation

نویسندگان

  • Inès Zribi
  • Mariem Ellouze
  • Lamia Hadrich Belguith
  • Philippe Blache
چکیده

Corpora are considered as an important resource for natural language processing (NLP). Currently, the Dialectal Arabic corpora are somewhat limited, particularly in the case of the Tunisian Arabic. In recent years, since the events of the revolution, the increasing presence of spoken Tunisian Arabic in interviews, news and debate programs, the increasing use of language technologies for many spoken languages (e.g., Siri) [6], and the need for works on speech technologies requires a huge amount of well-designed Tunisian spoken corpora. This paper presents the “STAC” corpus (Spoken Tunisian Arabic Corpus) of spontaneous Tunisian Arabic speech. We present our method used for the collection and the transcription of this corpus. Then, we detail the different stages done to enrich the corpus with necessary linguistic and speech annotations that makes it more useful for many NLP applications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Domain-related Annotation of Polish Spoken Dialogue Corpus LUNA.PL

In this paper we present a corpus of Polish spoken dialogues annotated on several levels, from transcription of dialogues and their morphosyntactic analysis, to semantic annotation. The corpus is one of the results of LUNA project. The description is concentrated on the semantic annotation on the levels of concepts (attribute-value) and predicates (frame sets).

متن کامل

Grammatical Error Annotation for Korean Learners of Spoken English

The goal of our research is to build a grammatical error-tagged corpus for Korean learners of Spoken English dubbed Postech Learner Corpus. We collected raw story-telling speech from Korean university students. Transcription and annotation using the Cambridge Learner Corpus tagset were performed by six Korean annotators fluent in English. For the annotation of the corpus, we developed an annota...

متن کامل

Annotation of Polish spoken dialogs in LUNA project

In this paper we present general assumptions and goals of the LUNA (spoken Language UNderstanding in multilinguAl communication systems) project. We describe the process of collecting a Polish corpus of spoken dialogs and the accepted annotation schema of this corpus at several levels, from transcription of dialogs and morphosyntactic analysis, to semantic and dialog acts annotation.

متن کامل

Semi-automatic Domain Ontology Construction from Spoken Corpus in Tunisian Dialect: Railway Request Information

In this paper, we present a hybrid method for semi-automatic building of domain ontology from spoken dialogue corpus in Tunisian Dialect for the railway request information domain. The proposed method is based on a statistical method for term and concept extraction and a linguistic method for semantic relation extraction. This method consists of three fundamental phases, namely the corpus const...

متن کامل

Tunisian dialect Wordnet creation and enrichment using web resources and other Wordnets

In this paper, we propose TunDiaWN (Tunisian dialect Wordnet) a lexical resource for the dialect language spoken in Tunisia. Our TunDiaWN construction approach is founded, in one hand, on a corpus based method to analyze and extract Tunisian dialect words. A clustering technique is adapted and applied to mine the possible relations existing between the Tunisian dialect extracted words and to gr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Research in Computing Science

دوره 90  شماره 

صفحات  -

تاریخ انتشار 2015